Chapter 1: About the project

The course focuses on important statistical techniques utilised in research. It is hands-on, where I get to learn how to use various tools such as R, github, etc. The main theme is open science which is the new paradigm that is being pushed in today’s world.

This is the link to my repository


Chapter 2: Regression and model validation

Describe the work you have done this week and summarize your learning.

This week focuses on data wrangling, regression analysis and model validation, In a survey, some questions were asked to understand the students’ approaches used in learning. Based on these, the following variables were extracted: gender, age, examination points, strategic approach, deep learning approach, attitude and surface learning approach.

The result of my analysis showed that there were about two times more female students respondents than their male counterparts. Overall, attitude seems to be the most determining factor towards the success of students in their examination. Strategic learning approach also seems to contribute to some extent to the success. While age showed a very low negative effect, it cannot be concluded to be a major factor as most of the respondents were young students below 26.

To validate my model, I used the QQ plot, residual vs fitted and residuals vs leverage. In making choice of my predictor variables, I examined the standard error, Multiple R squared(coefficient of determination) and the p value to check the significance of variables. ggpairs() function also came in handy when getting a broad overview of the distribution of the variables.

#read the students2014 data

lrn14<- read.table("C:/Users/oyeda/Desktop/OPEN_DATA_SCIENCE/IODS-project/data/learning2014.txt ", header=T, sep = "\t")

#Alternatively, I could also read the already wrangled data as below:
#lrn14<- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/co#urse_2218/datasets/learning2014.txt ", header=T, sep = ",")


#structure of the data
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 4.75 3.83 3.25 4.33 4 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 2.42 1.92 2.83 2.17 3 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 3.62 2.25 4 4.25 3.5 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
#dimension of the data
dim(lrn14)
## [1] 166   7

The data includes 7 variables which describes several attributes of students:

#load packages
library(ggplot2)
#install.packages("GGally")
library(GGally)
#create a pair plot to see the broad overview of the data
ggpairs(lrn14, aes(col=gender), lower = list(combo = wrap("facethist", bins = 20)))

#summary of the data
summary(lrn14)
##       deep           surf            stra            age       
##  Min.   :1.58   Min.   :1.580   Min.   :1.250   Min.   :17.00  
##  1st Qu.:3.33   1st Qu.:2.420   1st Qu.:2.620   1st Qu.:21.00  
##  Median :3.67   Median :2.830   Median :3.185   Median :22.00  
##  Mean   :3.68   Mean   :2.787   Mean   :3.121   Mean   :25.51  
##  3rd Qu.:4.08   3rd Qu.:3.170   3rd Qu.:3.620   3rd Qu.:27.00  
##  Max.   :4.92   Max.   :4.330   Max.   :5.000   Max.   :55.00  
##     attitude         points      gender 
##  Min.   :1.400   Min.   : 7.00   F:110  
##  1st Qu.:2.600   1st Qu.:19.00   M: 56  
##  Median :3.200   Median :23.00          
##  Mean   :3.143   Mean   :22.72          
##  3rd Qu.:3.700   3rd Qu.:27.75          
##  Max.   :5.000   Max.   :33.00

By and large, we have almost twice as much female respondents out of 166 students, aside those with examination point 0. The average age is about 26 years with, the youngest being 17 years and an old student of age 55. The ages of the respondents are mainly skewed to the right, showing they are mostly young students below 26 years. We can also see from the box plot that the age the age is non-normal and skewed to the right with some outliers.

The attitude, points and deep learning approach also have few outliers. Generally, the pairs show weak correaltions. However, attitude seems to be the most strongly correlated(although still weak) with points, for male and female. Men generally seem to have a slightly better attitude than their female counterpart, although, some male students show very poor attitude towards learning.

Both genders also appear to have average to very deep learning approach. Female students have a slightly better strategic approach. However, there appears to be not much distinctions between examinations points of male and female students. Most students seem to have more above average score while a few students showed very poor performance.

trying out the model using all the variables as predictors

l_model<-lm(points~gender+age+attitude+deep+stra+surf, data = lrn14)
summary(l_model)
## 
## Call:
## lm(formula = points ~ gender + age + attitude + deep + stra + 
##     surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4513  -3.2513   0.3836   3.5240  11.4869 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.32352    5.25571   3.486 0.000633 ***
## genderM     -0.06009    0.92983  -0.065 0.948551    
## age         -0.09606    0.05396  -1.780 0.076916 .  
## attitude     3.44426    0.59775   5.762 4.19e-08 ***
## deep        -1.04826    0.78408  -1.337 0.183155    
## stra         0.95823    0.55156   1.737 0.084271 .  
## surf        -1.11001    0.84433  -1.315 0.190519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.265 on 159 degrees of freedom
## Multiple R-squared:  0.2312, Adjusted R-squared:  0.2022 
## F-statistic: 7.971 on 6 and 159 DF,  p-value: 1.588e-07

I attempted to use all the variables for the model and stepwise regression to help remove the redundant variables.

library(MASS)   #load package
stepw_reg<-stepAIC(l_model, direction = "both")
## Start:  AIC=558.34
## points ~ gender + age + attitude + deep + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - gender    1      0.12 4408.0 556.35
## - surf      1     47.91 4455.8 558.14
## - deep      1     49.55 4457.5 558.20
## <none>                  4407.9 558.34
## - stra      1     83.67 4491.6 559.46
## - age       1     87.88 4495.8 559.62
## - attitude  1    920.43 5328.3 587.82
## 
## Step:  AIC=556.35
## points ~ age + attitude + deep + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - surf      1     47.82 4455.8 556.14
## - deep      1     49.66 4457.7 556.21
## <none>                  4408.0 556.35
## - stra      1     88.28 4496.3 557.64
## - age       1     90.18 4498.2 557.71
## + gender    1      0.12 4407.9 558.34
## - attitude  1    999.58 5407.6 588.27
## 
## Step:  AIC=556.14
## points ~ age + attitude + deep + stra
## 
##            Df Sum of Sq    RSS    AIC
## - deep      1     27.03 4482.9 555.14
## <none>                  4455.8 556.14
## + surf      1     47.82 4408.0 556.35
## - age       1     75.38 4531.2 556.92
## - stra      1    106.09 4561.9 558.04
## + gender    1      0.02 4455.8 558.14
## - attitude  1   1085.20 5541.0 590.32
## 
## Step:  AIC=555.14
## points ~ age + attitude + stra
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  4482.9 555.14
## - age       1     76.60 4559.5 555.95
## + deep      1     27.03 4455.8 556.14
## + surf      1     25.19 4457.7 556.21
## - stra      1     97.54 4580.4 556.71
## + gender    1      0.01 4482.9 557.14
## - attitude  1   1061.20 5544.1 588.41

After using stepwise regression, I ended up with three explanatory variables namely: ‘age’,‘attitude’ and ‘stra’

Now, let’s refit the model with these three variables.

l_model2<-lm(points ~ age + attitude + stra, data = lrn14)
summary(l_model2)
## 
## Call:
## lm(formula = points ~ age + attitude + stra, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1199  -3.2006   0.3307   3.4131  10.7605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89473    2.64895   4.113 6.20e-05 ***
## age         -0.08821    0.05302  -1.664   0.0981 .  
## attitude     3.48143    0.56219   6.193 4.69e-09 ***
## stra         1.00324    0.53436   1.877   0.0623 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.072e-08

As shown by the model, the p value of ‘attitude’ is very low below 0.05 and with higher confidence that ‘attitude’ significantly affects the examination points. However, the coeffient of determination(Multiple R-squared) of the model is just (0.2182)21.82% which is relatively low. The standard error of age is quite low but age seems insignificant. Just as points and age showed a low negative correlation earlier, it can be seen that age has a slight negative effect(-0.08821) on the examination points while attitude, which showed slightly higher positive correlation with points earlier, has a positive effect(3.48143) on the examination points

I will remove the age and strategic approach because they are insignificant with p-values lower than 0.05.

Finally, I am left with predictors;‘attitude’ attitude seems to have positive effect on the exam points.

#Final model
lm_final<-lm(points~attitude, data = lrn14)
summary(lm_final)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Although, the coefficient of determination(Multiple R-Squared) reduced slightly to 19.06% from 21.82% and the standard error only increased from 5.26 to 5.32. The difference is not so considerable. The p-value is also well below 0.05.

Thus, I decided to use ‘attitude’. This is in accordance to parsimony whereby it is advisable to use as less predictors as possible for building models.

par(mfrow=c(2,2))
plot(lm_final, which = c(1,2,5))

As seen in the Normal Q-Q plot, the points fit well to the line, aside few deviations at the beginning and end. The errors can thus, be said to be normally distributed.

For the residuals vs fitted, it can be seen that there is a slight change as the values inceas. However, no observation seems to have clear leverage over others.


Chapter 3: Logistic regression

The data includes students’ secondary education accomplishment of two portuguese schools. The data attributes include student grades, demographic, social and school related features’) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details). This is the data source

Data Set Information:

Data was gotten from this source

# access the tidyverse libraries tidyr, dplyr, ggplot2
#install.packages("tidyr")
library(tidyr); library(dplyr); library(ggplot2); library(GGally);

#read the data
alc<- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = ",", header = T)

DATA OVERVIEW

colnames(alc)   #names of columns
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
summary(alc)    #summary
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.442  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.034   Mean   :0.2906                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel        freetime         goout      
##  no : 18   no :261   Min.   :1.00   Min.   :1.000   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.00   1st Qu.:3.000   1st Qu.:2.000  
##                      Median :4.00   Median :3.000   Median :3.000  
##                      Mean   :3.94   Mean   :3.223   Mean   :3.113  
##                      3rd Qu.:5.00   3rd Qu.:4.000   3rd Qu.:4.000  
##                      Max.   :5.00   Max.   :5.000   Max.   :5.000  
##       Dalc            Walc          health         absences     
##  Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:3.000   1st Qu.: 0.000  
##  Median :1.000   Median :2.00   Median :4.000   Median : 3.000  
##  Mean   :1.474   Mean   :2.28   Mean   :3.579   Mean   : 5.319  
##  3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000   3rd Qu.: 8.000  
##  Max.   :5.000   Max.   :5.00   Max.   :5.000   Max.   :75.000  
##        G1              G2              G3           alc_use     
##  Min.   : 3.00   Min.   : 0.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 8.00   1st Qu.: 8.25   1st Qu.: 8.00   1st Qu.:1.000  
##  Median :10.50   Median :11.00   Median :11.00   Median :1.500  
##  Mean   :10.86   Mean   :10.71   Mean   :10.39   Mean   :1.877  
##  3rd Qu.:13.00   3rd Qu.:13.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :19.00   Max.   :19.00   Max.   :20.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:270      
##  TRUE :112      
##                 
##                 
## 
dim(alc)    #dimension
## [1] 382  35
str(alc)   #structure
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
attach(alc)   #attach all the variables in the dataframe
# define a new logical column 'high_use'
alc$high_use<- alc$alc_use > 2
#alc <- mutate(alc, high_use = alc_use > 2) #alternative

Hypotheses.

  • age is not related to alcohol consumption
  • sex is not related to alcohol consumption
  • absence from school is not related to alcóhol use
  • freetime after school is not related to alcohol consumption

subsetting and exploring my chosen variables

hyp<- alc[,c("age", "sex",  "absences","freetime","alc_use")]
#exploring my chosen variables
# draw a bar plot of each variable
hyp<-gather(hyp) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")
hyp + geom_bar()

The above shows the distribution of the chosen predictors and also scale of alcohol consumption. The ages of the students are mainly within 15-19 with few students aged 20 and 22. Generally, there are relatively few chronic alcohol consumers amongst the students. The students seem to have quite enough free time. The respondents include 198 female students and 184 male students.

Overview of the chosen predictors and alcohol consumption variable.

hyp<- alc[,c("age", "sex",  "absences","freetime","alc_use")]
#show the overview of the predictors and response variable(non-binomial)
plot_hyp <- ggpairs(hyp, mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
plot_hyp   #draw the plot

Here, we can see that the female students seem to be generally quite older and there are both male and female students that are much older than the general sample of the students. The absences are not so high but some chronic absentees amongst the students(both male and female). Most of the students are quite free but a few of the students are very busy. Largely, all the predictors show no correlation, hence, the issue of multicolinearity is not a problem here. However, these predictors still show very weak correlation with alcohol consumption. Nevertheless, it should be recalled that correlation is not causation. Hence, I will dig further to understand the effects of these predictors on alcohol consumption amongst students and see if they are significant.

Crosstabs

The below are crosstabs that compare the predictors with alcohol consumption

#using the binomial response(high_use)
#it is also possible to use simple crosstabs e.g
#xtabs(high_use~age, data = alc)
#but below is a more comprehensive crosstab.
#Crosstabs
summarise(group_by(alc, age,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 12 x 4
## # Groups:   age [?]
##      age high_use count mean_grade
##    <int>    <lgl> <int>      <dbl>
##  1    15    FALSE    63  11.380952
##  2    15     TRUE    18  11.055556
##  3    16    FALSE    79  11.303797
##  4    16     TRUE    28  10.714286
##  5    17    FALSE    65  10.123077
##  6    17     TRUE    35  10.228571
##  7    18    FALSE    55   9.218182
##  8    18     TRUE    26   9.423077
##  9    19    FALSE     7   5.714286
## 10    19     TRUE     4   6.250000
## 11    20    FALSE     1  18.000000
## 12    22     TRUE     1   8.000000
#an alternative and preferrable way of doing the above
#alc %>% group_by(age, high_use) %>% summarise(count=n(), mean_grade =  mean(G3))

summarise(group_by(alc, sex,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 4 x 4
## # Groups:   sex [?]
##      sex high_use count mean_grade
##   <fctr>    <lgl> <int>      <dbl>
## 1      F    FALSE   157   9.687898
## 2      F     TRUE    41  10.414634
## 3      M    FALSE   113  11.610619
## 4      M     TRUE    71   9.971831
summarise(group_by(alc, absences,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 48 x 4
## # Groups:   absences [?]
##    absences high_use count mean_grade
##       <int>    <lgl> <int>      <dbl>
##  1        0    FALSE    94   8.372340
##  2        0     TRUE    22   8.318182
##  3        1    FALSE     2  12.000000
##  4        1     TRUE     1  15.000000
##  5        2    FALSE    54  12.296296
##  6        2     TRUE    13  11.076923
##  7        3    FALSE     2  11.000000
##  8        3     TRUE     4  13.000000
##  9        4    FALSE    36  11.666667
## 10        4     TRUE    15  10.200000
## # ... with 38 more rows
summarise(group_by(alc, freetime,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 10 x 4
## # Groups:   freetime [?]
##    freetime high_use count mean_grade
##       <int>    <lgl> <int>      <dbl>
##  1        1    FALSE    16    9.12500
##  2        1     TRUE     2    8.50000
##  3        2    FALSE    47   11.87234
##  4        2     TRUE    15   11.60000
##  5        3    FALSE   116    9.75000
##  6        3     TRUE    40    9.82500
##  7        4    FALSE    69   10.46377
##  8        4     TRUE    40   10.10000
##  9        5    FALSE    22   12.54545
## 10        5     TRUE    15    9.80000

Bar chats

The age distribution by sex

# initialize a plot of 'age'
sex_age <- ggplot(data = alc, aes(x=age, col=sex))

# draw a bar plot of age by sex
sex_age + geom_bar() + facet_wrap("sex")

high alcohol consumption by sex

# initialize a plot of 'high_use'
hu_sex <- ggplot(data = alc, aes(x=sex, col=sex))

# draw a bar plot of high_use by sex
hu_sex + geom_bar() + facet_wrap("high_use")

Here, we can see that there are more female who do are not high consumers of alcohol compared to men. Men consume more alchols than their female counterparts.

High use vs freetime

# initialize a plot of 'high_use'
hu_ft <- ggplot(data = alc, aes(x=freetime, col=sex))

# draw a bar plot of 'high_use' by freetime
hu_ft + geom_bar() + facet_wrap("high_use")

Most students with more freetime seem not to consume alcohol as one would expect

High use vs absences

# initialize a plot of 'high_use'
hu_ab <- ggplot(data = alc, aes(x=absences, col=sex))

# draw a bar plot of 'high_use' by freetime
hu_ab + geom_bar() + facet_wrap("high_use")

#few of the absentees are alcoholic but many are not.

High use vs age

# initialize a plot of 'high_use'
hu_ag <- ggplot(data = alc, aes(x=age, col=sex))

# draw a bar plot of 'high_use' by freetime
hu_ag + geom_bar() + facet_wrap("high_use")

some of the young students below 20 years are high consumers of alcohol, but less so of those that are not high alchol consumers.

High use vs romantic

# initialize a plot of 'high_use'
hu_rom <- ggplot(data = alc, aes(x=romantic, col=sex))

# draw a bar plot of 'high_use' by freetime
hu_rom + geom_bar() + facet_wrap("high_use")

For students in romantic relationships, they generally have lower number of them that are high consumers of alcohol. This is same with those in no romatic relationships.

BOXPLOTS

**Below are boxplots to give clearer pictures, as explained earlier. ##Relationship of alcohol consumption with absences

# initialise a plot of high_use and absences
h_ab<- ggplot(alc, aes(x=high_use, y=absences,col=sex))

# define the plot as a boxplot and draw it
h_ab + geom_boxplot() + ggtitle("Student absences vs alcohol consumption")

There are a few chronic absentees amongst male and female students, but generally, most of the students have low absences.

Relationship of alcohol consumption with age

# initialise a plot of high_use and age
h_ag<- ggplot(alc, aes(x=high_use, y=age,col=sex))

# define the plot as a boxplot and draw it
h_ag + geom_boxplot() + ggtitle("Students' age vs alcohol consumption")

Relationship of alcohol consumption with freetime

# initialise a plot of high_use and freetime
h_fr<- ggplot(alc, aes(x=high_use, y=freetime,col=sex))

# define the plot as a boxplot and draw it
h_fr + geom_boxplot() + ggtitle("Student's freetime vs alcohol consumption")

Relationship of alcohol consumption with sex

# initialise a plot of high_use and sex
alc_sex<- ggplot(alc, aes(y=alc_use, x=sex,col=sex))

# define the plot as a boxplot and draw it
alc_sex + geom_boxplot() + ggtitle("Student's alcohol consumption by sex")

There are a few female students with quite high alcohol consumption but the alcohol consumption levels do not vary as much as they do amongst the male students.

Logistic regression

fitting the glm model

high_use_mod1<- glm(high_use~ age + sex + absences + freetime , data= alc, family = "binomial")
summary(high_use_mod1)
## 
## Call:
## glm(formula = high_use ~ age + sex + absences + freetime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5188  -0.8214  -0.5957   1.0480   2.1082  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.35260    1.76384  -3.035 0.002408 ** 
## age          0.15609    0.10221   1.527 0.126698    
## sexM         0.89550    0.24773   3.615 0.000301 ***
## absences     0.07162    0.01802   3.974 7.08e-05 ***
## freetime     0.30118    0.12578   2.394 0.016644 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 417.53  on 377  degrees of freedom
## AIC: 427.53
## 
## Number of Fisher Scoring iterations: 4
# coef(high_use_mod1)
# coefficients(high_use_mod1)

The predictors “age” and “freetime”are not significant. Hence, these could be considered as redundant variables and not have consierable effect on alcohol consumption compared with asences and sex which are significant. hence, I could accept the null hypothesis that age and freetime are not related to alcohol consumption. On the other hand, I reject the null hypothesis that male sex and absences are not related to alcohol consumption. They both have positive effects on alcohol consumption.

Refitting the model without the redundant variables

high_use_mod2<- glm(high_use~  sex + absences, data= alc, family = "binomial")
summary(high_use_mod2)
## 
## Call:
## glm(formula = high_use ~ sex + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7368  -0.8501  -0.5838   1.0919   1.9899  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.83117    0.21956  -8.340  < 2e-16 ***
## sexM         0.99923    0.24179   4.133 3.59e-05 ***
## absences     0.07403    0.01811   4.089 4.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 425.79  on 379  degrees of freedom
## AIC: 431.79
## 
## Number of Fisher Scoring iterations: 4
#options("contrasts")

check the overall effect of the variable by performing a likelihood ratio test

anova(high_use_mod1, high_use_mod2, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ age + sex + absences + freetime
## Model 2: high_use ~ sex + absences
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       377     417.53                       
## 2       379     425.79 -2  -8.2563  0.01611 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There seems to be no significant difference between the two models.
Hence, ‘sex’ and ‘absences’ are significant enough without the redundant variables.

let’s also see if “sex” is alos significant in the prediction

high_use_mod3<- glm(high_use~  absences, data= alc, family = "binomial")
anova(high_use_mod2, high_use_mod3, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ sex + absences
## Model 2: high_use ~ absences
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1       379     425.79                         
## 2       380     443.69 -1  -17.907 2.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test shows that sex is important and would remain in the model.

Hence, final model is: high_use~ sex + absences

calculating the odds ratio and confident interval

# compute odds ratios (odds_ra)
odds_ra <- exp(coef(high_use_mod2))
#odds_ra <- coef(high_use_mod2) %>% exp     #alternaive

# compute confidence intervals (conf_int)
conf_int <- exp(confint(high_use_mod2)) 
#Conf_Int <- high_use_mod2 %>%  confint() %>% exp   #alternative

# print out the odds ratios with their confidence intervals
cbind(odds_ra, conf_int)
##               odds_ra     2.5 %    97.5 %
## (Intercept) 0.1602264 0.1022869 0.2424271
## sexM        2.7161808 1.7015022 4.3985356
## absences    1.0768398 1.0414538 1.1176941

Here, we can see that Male sex and absences are positively associated with success and are more likely to affect alcohol consumption. The confident interval of Male sex is quite high compared to “absences” which is narrower and much more likely to affect alcohol consumption.

fit the model

# predict() the probability of high_use
probs<- predict(high_use_mod2, type = "response")

add the predicted probabilities to ‘alc’

alc$prob_high_use <- probs
#alc <- mutate(alc, probability = probabilities)

use the probabilities to make a prediction of high_use, setting 0.5 as threshold

alc$predict_high_use<- (alc$prob_high_use)>0.5
#alc <- mutate(alc, prediction = prob_high_use>0.5)

see the first ten and last ten original classes, predicted probabilities, and class predictions

head(alc[,c("failures", "absences", "sex", "high_use", "prob_high_use", "predict_high_use")], n=10)
##    failures absences sex high_use prob_high_use predict_high_use
## 1         0        6   F    FALSE     0.1998897            FALSE
## 2         0        4   F    FALSE     0.1772568            FALSE
## 3         3       10   F     TRUE     0.2514562            FALSE
## 4         0        2   F    FALSE     0.1566846            FALSE
## 5         0        4   F    FALSE     0.1772568            FALSE
## 6         0       10   M    FALSE     0.4771074            FALSE
## 7         0        0   M    FALSE     0.3032349            FALSE
## 8         0        6   F    FALSE     0.1998897            FALSE
## 9         0        0   M    FALSE     0.3032349            FALSE
## 10        0        0   M    FALSE     0.3032349            FALSE
select(alc, failures, absences, sex, high_use, prob_high_use, predict_high_use) %>% tail(10)
##     failures absences sex high_use prob_high_use predict_high_use
## 373        1        0   M    FALSE     0.3032349            FALSE
## 374        1       14   M     TRUE     0.5509447             TRUE
## 375        0        2   F    FALSE     0.1566846            FALSE
## 376        0        7   F    FALSE     0.2119931            FALSE
## 377        1        0   F    FALSE     0.1380993            FALSE
## 378        0        0   F    FALSE     0.1380993            FALSE
## 379        1        0   F    FALSE     0.1380993            FALSE
## 380        1        0   F    FALSE     0.1380993            FALSE
## 381        0        3   M     TRUE     0.3520937            FALSE
## 382        0        0   M     TRUE     0.3032349            FALSE

tabulate the target variable versus the predictions

table(high_use = alc$high_use, prediction = alc$predict_high_use)
##         prediction
## high_use FALSE TRUE
##    FALSE   263    7
##    TRUE     89   23

The model rightly predicted 263 False high use and 23 True high_use of alcohol. It wrongly predicted 89 True high_use and 7 False high_use

Plotting the predicted and observed alcohol consumption

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = prob_high_use, y = high_use, col= predict_high_use))

# define the geom as points and draw the plot
g + geom_point()

#The wrong preditctions were quite much

tabulate the target variable versus the predictions

conf_mat<-table(high_use = alc$high_use, prediction = alc$predict_high_use)
conf_mat<-prop.table(conf_mat)
addmargins(conf_mat)
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68848168 0.01832461 0.70680628
##    TRUE  0.23298429 0.06020942 0.29319372
##    Sum   0.92146597 0.07853403 1.00000000
#Alternatively, this can be done as shown below:
#addmargins(prop.table(table(high_use = alc$high_use, prediction = alc$predict_high_use)))
#table(high_use = alc$high_use, prediction = alc$predict_high_use) %>%  prop.table() %>% addmargins()

mean error of the prediction

mean(abs(alc$high_use-alc$prob_high_use)>0.5)
## [1] 0.2513089

My model has slightly lower error of 0.25 compared with the one on datacamp with 0.26 mean error.

The below is an alternative way by firstly defining the function to calculate the mean error.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$prob_high_use)
## [1] 0.2513089

K-fold cross-validation

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = high_use_mod2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2539267

THIS IS THE LINK TO MY DATA WRANGLING


Chapter 4: Clustering and classification

1. Loading the libraries

#Get access to the libraries
library(MASS)
library(ggplot2)
library(GGally)
library(tidyr)
#install.packages("plotly")
library(tidyverse)
library(corrplot)
library(dplyr)
library(plotly)

2. Data and brief overview

data("Boston")
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
#structure of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#dimension of the data
dim(Boston)
## [1] 506  14
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

The data includes information about Housing Values in Suburbs of Boston. The data has 14 variables and 506 observations. The relevant information in the dataset is described below:

Variable Definition
crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitrogen oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted mean of distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per $10,000
ptratio pupil-teacher ratio by town
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat lower status of the population (percent)
medv median value of owner-occupied homes in $1000s

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)

3. Graphical overview of the data

  • Below are the graohical representations which show the distribution and the correlations of the variables.
#graphical overview of the data
#ggpairs(Boston)
#pairs(Boston)
#plot(Boston)

#summary of the data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
cor_matrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
##            ptratio       black      lstat       medv
## crim     0.2899456 -0.38506394  0.4556215 -0.3883046
## zn      -0.3916785  0.17552032 -0.4129946  0.3604453
## indus    0.3832476 -0.35697654  0.6037997 -0.4837252
## chas    -0.1215152  0.04878848 -0.0539293  0.1752602
## nox      0.1889327 -0.38005064  0.5908789 -0.4273208
## rm      -0.3555015  0.12806864 -0.6138083  0.6953599
## age      0.2615150 -0.27353398  0.6023385 -0.3769546
## dis     -0.2324705  0.29151167 -0.4969958  0.2499287
## rad      0.4647412 -0.44441282  0.4886763 -0.3816262
## tax      0.4608530 -0.44180801  0.5439934 -0.4685359
## ptratio  1.0000000 -0.17738330  0.3740443 -0.5077867
## black   -0.1773833  1.00000000 -0.3660869  0.3334608
## lstat    0.3740443 -0.36608690  1.0000000 -0.7376627
## medv    -0.5077867  0.33346082 -0.7376627  1.0000000
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper")

From the above, we can see the positive and negative correlations. For instance, industrial land use and nitrogen oxide concentration are positively correlated. Industrial land use is also positively correlated with tax. Crime is positively correlated with index of accessibility to radial highways. “age” and “dis” are negatively correlated and “dis” and “tax” are slightly negatively correlated too. From the correlation plot, we can see other weak to strong negative and positive correaltios between the variables.

4. Standardising the dataset and others:

Linear discrimant analysis will be performed later, hence, there is need to scale the entire dataset. this is done by subtrating the columns means from the various columns and divide this difference by the standard deviation of the column.

  • center and standardize variables
boston_scaled <- scale(Boston)
  • summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
  • class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
  • change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)
  • summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
  • create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
  • create a categorical variable ‘crime’
  • Here, the per capita crime(crim) is cut into quantiles and converted from continuous to categorical varibale. It is further made into factor variable to derive the different levels of crime.
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
  • look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

To avoid confusion, I will be deleting old continuous variable “crim” for the newly created categorical variable “crime” - remove original crim from the dataset

boston_scaled <- dplyr::select(boston_scaled, -crim)
  • add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
  • Sampling Here, the data is split into training and testing data to allow for predictive performance of my model. The training is done with the train data while the testing is done with the test data.
#number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

5. linear discriminant analysis

Here crime is made as the target variable while all other variables as the predictor. Linear discriminant analysis (LDA) which is a generalization of Fisher’s linear discriminant, is a method utilised for finding a linear combination of features that separates or characterises two or more objects’ or events’ classes. it is used in recognising patterns, machine and statistical learning. see more here.

Put differently, it is a classification (and dimension reduction) method which finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.

Linear discriminant analysis is akin to many other methods, such as principal component analysis. LDA can be visualized with a biplot.

lda.fit <- lda(crime~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2599010 0.2549505 0.2326733 
## 
## Group means:
##                  zn      indus        chas        nox         rm
## low       0.9766777 -0.8765504 -0.07933396 -0.8920830  0.4300768
## med_low  -0.1001209 -0.2808571 -0.04735191 -0.5460669 -0.1326688
## med_high -0.3773415  0.1704345  0.22458650  0.4125047  0.1679526
## high     -0.4872402  1.0172896 -0.06290884  1.0794274 -0.3827428
##                 age        dis        rad        tax     ptratio
## low      -0.8966127  0.8687272 -0.7003842 -0.7296461 -0.47804684
## med_low  -0.3215100  0.3320423 -0.5421724 -0.4712768 -0.04632703
## med_high  0.3829106 -0.3682981 -0.3864524 -0.2987114 -0.38172202
## high      0.8511585 -0.8720402  1.6363892  1.5128120  0.77875205
##                black       lstat         medv
## low       0.37927726 -0.75745652  0.536777824
## med_low   0.31543074 -0.12059250  0.003785183
## med_high  0.06741356 -0.06139186  0.211253653
## high     -0.84461172  0.91678672 -0.699669583
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.095854405  0.746811561 -0.89958901
## indus    0.001087597 -0.148543850  0.21695259
## chas    -0.084770418 -0.033675848  0.06076634
## nox      0.439173661 -0.895705742 -1.25949698
## rm      -0.082778051 -0.077859887 -0.17486744
## age      0.262551345 -0.275129591 -0.07413945
## dis     -0.049441082 -0.317132113  0.14508692
## rad      2.983877625  0.884780538 -0.11676629
## tax     -0.035915294  0.010957487  0.45659400
## ptratio  0.135203751  0.017834765 -0.12784108
## black   -0.135730532  0.006233904  0.13759036
## lstat    0.225677308 -0.079462141  0.52134479
## medv     0.204283253 -0.352178261 -0.06568365
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9434 0.0421 0.0146

LDA utilises the trained model to calculate the probabilities of the observations belonging to the various classes and thereafter, classifies the observations to the the class which is most likely(probable.)

In order to visualise the result, the function coined from the datacamp exercise will be used. This was originally derived from a comment in stackoveflow here

  • the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
  • target classes as numeric
train$crime <- as.numeric(train$crime)
  • plot the lda results
plot(lda.fit, dimen = 2, col = train$crime, pch= train$crime)
lda.arrows(lda.fit, myscale = 2)

The above arrows depict the relationship between the original variables and the LDA solution.

6. The predictive performance of the LDA

  • save the correct classes from test data
correct_classes <- test[,"crime"]
  • remove the categorical crime variable from test data
test <- dplyr::select(test, -crime)
  • predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
  • cross tabulate the results
table(correct = correct_classes , predicted = lda.pred$class )
##           predicted
## correct    low med_low med_high high
##   low       14      10        1    0
##   med_low    3      17        1    0
##   med_high   1       8       14    0
##   high       0       0        0   33

The predictive performance of the high class is quite high compared to other classes which are quite poor. The low_medium was the worst wrongly guessed/predicted.

7. Clustering

  • reload the Boston dataset
# load MASS and Boston
library(MASS)
data('Boston')
  • standardise the data to get comparable distances later
boston_standard <- scale(Boston)
  • **For calculating he distances between observations, I’ll be using euclidean distance. There are other methods also, By default, dist() fucntion in r uses te euclidean distance, hence, there is no need to specify but it might be useful for clarity to specify
  • euclidean distance matrix
dist_eu <-dist(boston_standard, method= "euclidean")
  • look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
  • now, I’ll try out the manhattan distance too
  • ** manhattan distance matrix**
dist_man <- dist(boston_standard, method="manhattan")
  • look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

k-means clustering

K-means is a popularly used clustering method. It is an unsupervised method, that assigns observations to groups or clusters based on similarity of the objects. The distance matrix is counted automatically by the kmeans() function. source - run k-means algorithm on the dataset

km <-kmeans(Boston, centers = 3)
  • plot the Boston dataset with clusters
pairs(Boston[6:8], col = km$cluster)

  • determine the number of optimal number of clusters K-means: determine the k

K-means needs the number of clusters as an argument. There are many ways to look at the optimal number of clusters and a good way might depend on the data you have.

One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.

K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function set.seed() can be used to deal with that. source

set.seed(123)
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

From the plot, we can see that the value of changed drastically at about 2, hence, the centers will be set as 2 - k-means clustering

km <-kmeans(Boston, centers = 2)
  • plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

from the above, it is hard to see the variables, hence, I’ll select some of them for plotting.

pairs(Boston[4:7], col = km$cluster)

  • Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)
boston_standard2<-scale(Boston)

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_standard2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(boston_standard2, centers = 7)

#convert km to dataframe
boston_standard2<-as.data.frame(boston_standard2)

lda.fit_clus<- lda(km$cluster~., data=boston_standard2)

# plot the Boston dataset with clusters
pairs(boston_standard2[,3:7], col = km$cluster)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit_clus, dimen = 2, col = classes, pch= classes)
lda.arrows(lda.fit, myscale = 2)

  • Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)


# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

#Next, install and access the plotly package. Create a 3D plot (Cool!) of the columns of the matrix product by typing the code below.
library(plotly)
  • using the plotly package for 3D plotting of the matrix products’ columns.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
  • Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? (0-3 points to compensate any loss of points from the above exercises)
#first convert the crime in the training data to character/string
train$crime <- c("low", "med_low", "med_high", "high") 
 

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)

I got a better visualisation by using the crime classes as the colour representation.

  • **Lastly, I replotted by using the km clustering as the colour representation.
  • Here, I am using the the lda fit for the cluster k-mean which I derived earlier(i.e, lda.fit_clus)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit_clus$scaling)
## [1] 14  6
# matrix multiplication
matrix_product <- as.matrix(model_predictors)%*%lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)



plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km)


Chapter5: Dimensionality reduction techniques

Load packages and data

rm(list = ls())
# Access GGally
library(GGally)
library(corrplot)
library(tidyr)
library(ggplot2)
library(FactoMineR)
#install.packages("FactoMineR")

#the file path of the data
fpath<- "C:/Users/oyeda/Desktop/OPEN_DATA_SCIENCE/IODS-project/data/assignment5/human.txt"

#load the data
human<- read.table(fpath, sep=",", header = T)

Description of the dataset

dim(human)  #the dimension of the data
## [1] 155   8
str(human)    #the structure of the data
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

Description of the data set data originally from: Modified and analysed by Oyedayo Oyelowo 2017 The data used in this analysis consists of various development indicators in many countries across the globe. the data consists of 8 variables and 155 observations. All the variables are numerical(all being float exept the Gross National Income per capita and Maternal mortality ratio which are integer).

  • Below are the column names and their full meanings
  • “Country” = Country name

  • Health and knowledge

  • “GNI” = Gross National Income per capita
  • “Life.Exp” = Life expectancy at birth
  • “Edu.Exp” = Expected years of schooling
  • “Mat.Mor” = Maternal mortality ratio
  • “Ado.Birth” = Adolescent birth rate

  • Empowerment

  • “Parli.F” = Percetange of female representatives in parliament
  • “Edu2.F” = Proportion of females with at least secondary education
  • “Edu2.M” = Proportion of males with at least secondary education
  • “Labo.F” = Proportion of females in the labour force
  • “Labo.M” = Proportion of males in the labour force
  • “Edu2.FM” = Edu2.F / Edu2.M
  • “Labo.FM” = Labo2.F / Labo2.M

visualize the ‘human’ variables

ggpairs(human)

# compute the correlation matrix and visualize it with corrplot
cor(human)%>%corrplot()

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Here, we can see that life expectancy at birth and expected years of schooling have slightly high positive correlation. Maternal mortality ratio has and life expectancy at birth have strong negative correlation. Also, maternal mortality ratio has a quite high correlation of 0.76 with adolescent birth rate. The above graph gives more detail. Maternal mortality rate, although, generally on the low side, is very high in some countries, with the maximum of 11000 which is much more than the average of 149.1. The summary and graph above show more informatin about the distribution of this and other variables

  • Some information from the IODS datacamp exercise for reference purpose:
  • PCA with R: Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD). There are two functions in the default package distribution of R that can be used to perform PCA: princomp() and prcomp(). The prcomp() function uses the SVD and is the preferred, more numerically accurate method Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.

principal component analysis on the unstandardised data

pca_unstd_human<- prcomp(human)

#save the summary of the analysis into an object
s_hum<-summary(pca_unstd_human)

#print the summary
s_hum
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

get the proportion of variance which is in the second row and make it in percentage

pca_prop<- round(100*s_hum$importance[2,], digits = 3)
pca_prop
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 99.99  0.01  0.00  0.00  0.00  0.00  0.00  0.00

create object pc_lab to be used as axis labels

pc_label<-paste0(names(pca_prop), " (", pca_prop, "%)")

draw a biplot for the PCA of the unstandardised data

biplot(pca_unstd_human, cex = c(0.8, 1), col = c("grey40", "red"), xlab = pc_label[1], ylab = pc_label[2])

#deeppink2

PCA on Standardised dataset

scale the data “human”

#scale the data "human"
human_sca<- scale(human)

#check the summary to see that the variables have all been scaled
summary(human_sca)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850

perform a principal component analysis (SVD) on the scaled data

pca_human_sca<- prcomp(human_sca)

s_hum2<-summary(pca_human_sca)
s_hum2
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000

et the proportion of variance which is in the second row and make it in percentage

pca_prop2<- round(100*s_hum2$importance[2,], digits = 3)
pca_prop2
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 53.605 16.237  9.571  7.583  5.477  3.595  2.634  1.298

The biplot for the Standardised data(compared with the unscaled)

# create object pc_lab to be used as axis labels
pc_label2<-paste0(names(pca_prop2), " (", pca_prop2, "%)")

par(mfrow=c(1,2))
# draw a biplot
biplot(pca_human_sca, cex = c(0.7, 0.7), col = c("grey", "red"), xlab = pc_label2[1], ylab = pc_label2[2]
       , main="Standardised")

biplot(pca_unstd_human, cex = c(0.8, 1), col = c("grey40", "red"), xlab = pc_label[1], ylab = pc_label[2],
       main="Unstandardised")

- We can see that the unstandardised data differ from standardised. The GNI in the unstandardised has more leverage because it has huge values and had not been scaled to be on the same level with other variables. However, this was addressed in the scale data before the analysis was performed Here, Proportion of females in the labour force and Percetange of female representatives in parliamentare almost orthogonal(at right angle) and less related to Maternal mortality ratio, Adolescent birth rate, Expected years of schooling and Life expectancy at birth.

  • Also, while, Percetange of female representatives in parliament and Proportion of males in the labour force mostly contributed to the second principal component others contributed to the first, as shown in the biplot.

Loading the tea dataset and exploring the structure

#library(FactoMineR)
data(tea)     #load the tea dataset
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
colnames(tea)
##  [1] "breakfast"        "tea.time"         "evening"         
##  [4] "lunch"            "dinner"           "always"          
##  [7] "home"             "work"             "tearoom"         
## [10] "friends"          "resto"            "pub"             
## [13] "Tea"              "How"              "sugar"           
## [16] "how"              "where"            "price"           
## [19] "age"              "sex"              "SPC"             
## [22] "Sport"            "age_Q"            "frequency"       
## [25] "escape.exoticism" "spirituality"     "healthy"         
## [28] "diuretic"         "friendliness"     "iron.absorption" 
## [31] "feminine"         "sophisticated"    "slimming"        
## [34] "exciting"         "relaxing"         "effect.on.health"

Subsetting the data and exploring the structure of distribution

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where","lunch", "dinner", "always")

# select the 'keep_columns' to create a new dataset
tea_time <- tea[, keep_columns]
#tea_time <- select(tea, keep_columns)

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch            dinner   
##  chain store         :192   lunch    : 44   dinner    : 21  
##  chain store+tea shop: 78   Not.lunch:256   Not.dinner:279  
##  tea shop            : 30                                   
##                                                             
##         always   
##  always    :103  
##  Not.always:197  
##                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  8 variables:
##  $ Tea   : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How   : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how   : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner: Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always: Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...

visualize the tea dataset

#bar plot of the variables
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() 

multiple correspondence analysis of tea data

mca <- MCA(tea_time, graph = T)

# table of eigenvalues
#mca$eig

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = T) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.210   0.197   0.169   0.154   0.142   0.132
## % of var.             12.953  12.143  10.375   9.504   8.722   8.096
## Cumulative % of var.  12.953  25.096  35.471  44.975  53.698  61.794
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.123   0.109   0.105   0.097   0.085   0.058
## % of var.              7.544   6.726   6.440   5.952   5.218   3.596
## Cumulative % of var.  69.338  76.064  82.504  88.456  93.674  97.271
##                       Dim.13
## Variance               0.044
## % of var.              2.729
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.272  0.117  0.087 | -0.322  0.175  0.123 |  0.323
## 2                  | -0.243  0.094  0.048 | -0.144  0.035  0.017 |  0.588
## 3                  | -0.192  0.059  0.017 | -0.483  0.394  0.108 |  0.400
## 4                  | -0.321  0.163  0.047 | -0.500  0.423  0.115 |  0.089
## 5                  | -0.300  0.142  0.131 | -0.205  0.071  0.061 | -0.046
## 6                  | -0.192  0.059  0.017 | -0.483  0.394  0.108 |  0.400
## 7                  | -0.328  0.171  0.209 | -0.290  0.142  0.163 |  0.189
## 8                  | -0.243  0.094  0.048 | -0.144  0.035  0.017 |  0.588
## 9                  |  0.076  0.009  0.004 |  0.742  0.930  0.398 |  0.151
## 10                 |  0.361  0.206  0.101 |  0.567  0.543  0.249 |  0.643
##                       ctr   cos2  
## 1                   0.207  0.124 |
## 2                   0.683  0.281 |
## 3                   0.317  0.074 |
## 4                   0.016  0.004 |
## 5                   0.004  0.003 |
## 6                   0.317  0.074 |
## 7                   0.070  0.069 |
## 8                   0.683  0.281 |
## 9                   0.045  0.016 |
## 10                  0.817  0.320 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.424   2.629   0.059   4.192 |   0.081   0.103
## Earl Grey          |  -0.255   2.481   0.117  -5.918 |   0.132   0.707
## green              |   0.540   1.907   0.036   3.284 |  -0.952   6.314
## alone              |  -0.005   0.001   0.000  -0.121 |  -0.265   2.899
## lemon              |   0.653   2.781   0.053   3.967 |   0.573   2.289
## milk               |  -0.371   1.718   0.037  -3.309 |   0.305   1.236
## other              |   0.317   0.179   0.003   0.965 |   1.514   4.356
## tea bag            |  -0.594  11.887   0.462 -11.752 |  -0.382   5.238
## tea bag+unpackaged |   0.328   1.997   0.049   3.826 |   1.016  20.501
## unpackaged         |   1.951  27.132   0.519  12.459 |  -0.850   5.490
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.002   0.802 |   1.047  20.041   0.359  10.358 |
## Earl Grey            0.031   3.058 |  -0.417   8.277   0.313  -9.674 |
## green                0.112  -5.787 |   0.089   0.064   0.001   0.540 |
## alone                0.131  -6.252 |   0.071   0.242   0.009   1.671 |
## lemon                0.041   3.484 |  -1.032   8.681   0.132  -6.272 |
## milk                 0.025   2.717 |  -0.082   0.105   0.002  -0.732 |
## other                0.071   4.604 |   2.822  17.710   0.246   8.581 |
## tea bag              0.191  -7.553 |   0.049   0.102   0.003   0.975 |
## tea bag+unpackaged   0.471  11.871 |   0.062   0.089   0.002   0.725 |
## unpackaged           0.098  -5.426 |  -0.395   1.387   0.021  -2.522 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.118 0.112 0.383 |
## How                | 0.079 0.170 0.361 |
## how                | 0.691 0.493 0.021 |
## sugar              | 0.055 0.001 0.261 |
## where              | 0.722 0.667 0.054 |
## lunch              | 0.001 0.084 0.103 |
## dinner             | 0.016 0.031 0.031 |
## always             | 0.002 0.021 0.134 |
#dimdesc(mca)

visualize MCA

plot(mca, invisible=c("ind"), habillage="quali")

#The below is incase I want to see the observations that contribute the most
#For instance, the below shows the 3 observations that contribute the most.
plot(mca, invisible=c("ind"), habillage="quali", selectMod = "contrib 3")

mca
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 8 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$ind"            "results for the individuals"     
## 8  "$ind$coord"      "coord. for the individuals"      
## 9  "$ind$cos2"       "cos2 for the individuals"        
## 10 "$ind$contrib"    "contributions of the individuals"
## 11 "$call"           "intermediate results"            
## 12 "$call$marge.col" "weights of columns"              
## 13 "$call$marge.li"  "weights of rows"
#plot.MCA(mca)

The above plot looks a bit flavorless. Hence, I’ll create a more interesting plot using ggplot2. other packages can be used. See

Getting a better biplot for MCA using ggplot

## number of categories per variable
categ = apply(tea_time, 2, function(x) nlevels(as.factor(x)))

categ
##    Tea    How    how  sugar  where  lunch dinner always 
##      3      4      3      2      3      2      2      2
rep("c", 3)
## [1] "c" "c" "c"
# data frame with variable coordinates
mca_vars_df = data.frame(mca$var$coord, Variable = rep(names(categ), categ))
mca_vars_df
##                             Dim.1       Dim.2       Dim.3        Dim.4
## black                 0.423651485  0.08107960  1.04681861 -0.585421744
## Earl Grey            -0.254815184  0.13167636 -0.41657241  0.206151622
## green                 0.540276385 -0.95192208  0.08890598  0.107089275
## alone                -0.005145665 -0.26532782  0.07089725 -0.008821771
## lemon                 0.652506444  0.57311193 -1.03170232  0.716796890
## milk                 -0.371173973  0.30477825 -0.08213745 -0.590809805
## other                 0.317183601  1.51391127  2.82176366  1.698551747
## tea bag              -0.594335503 -0.38199371  0.04930828  0.023211198
## tea bag+unpackaged    0.327592844  1.01630733  0.06206211  0.172077242
## unpackaged            1.951203005 -0.84983218 -0.39489575 -0.558921233
## No.sugar              0.227855056  0.02967102  0.49368134 -0.371161035
## sugar                -0.243569198 -0.03171729 -0.52772833  0.396758347
## chain store          -0.517954077 -0.36948069  0.07302572 -0.106496997
## chain store+tea shop  0.409644442  1.33003920  0.08904367  0.193338346
## tea shop              2.249830542 -1.09342548 -0.69887815  0.178901081
## lunch                -0.059801438  0.69847346 -0.77513934 -0.702690710
## Not.lunch             0.010278372 -0.12005013  0.13322707  0.120774966
## dinner                0.464256677 -0.63716529  0.64640998  2.706238251
## Not.dinner           -0.034944051  0.04795868 -0.04865451 -0.203695352
## always                0.068164509  0.19941295 -0.50712645 -0.325285068
## Not.always           -0.035639312 -0.10426159  0.26514733  0.170072904
##                            Dim.5 Variable
## black                 0.71760141      Tea
## Earl Grey            -0.02900587      Tea
## green                -1.43952642      Tea
## alone                -0.43351279      How
## lemon                 0.98168642      How
## milk                  0.58564594      How
## other                 1.69373872      How
## tea bag               0.22592334      how
## tea bag+unpackaged   -0.65394909      how
## unpackaged            0.64067352      how
## No.sugar             -0.33971157    sugar
## sugar                 0.36313996    sugar
## chain store          -0.01455597    where
## chain store+tea shop -0.06007963    where
## tea shop              0.24936521    where
## lunch                -0.41109609    lunch
## Not.lunch             0.07065714    lunch
## dinner               -0.46752843   dinner
## Not.dinner            0.03519031   dinner
## always               -0.08536366   always
## Not.always            0.04463176   always
# data frame with observation coordinates
mca_obs_df = data.frame(mca$ind$coord)


# MCA plot of observations and categories
ggplot(data = mca_obs_df, aes(x = Dim.1, y = Dim.2)) +
  geom_hline(yintercept = 0, colour = "gray70") +
  geom_vline(xintercept = 0, colour = "gray70") +
  geom_point(colour = "gray50", alpha = 0.7) +
  geom_density2d(colour = "gray80") +
  geom_text(data = mca_vars_df, 
            aes(x = Dim.1, y = Dim.2, 
                label = rownames(mca_vars_df), colour = Variable)) +
  ggtitle("MCA plot of variables using R package FactoMineR") +
  scale_colour_discrete(name = "Variable")

To spice the graph up, I have decided to overlay the the categories on the observations. I also used geom_density2d() to show the density curves of the areas where individuals might be overlapping. This is like an isoline which shows the areas that are highly concentrated